In this lesson, we'll explore the boston housing dataset (which is built into sklearn), and walk through some basic principles of setting up and building, tuning and selecting a valid machine learning model.
This lesson will use sklearn in conjunction with several skutil preprocessing techniques.
In [2]:
from __future__ import print_function, division
import sklearn
import skutil
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
print('sklearn: %s' % sklearn.__version__)
print('skutil: %s' % skutil.__version__)
print('pandas: %s' % pd.__version__)
print('numpy: %s' % np.__version__)
In [3]:
from sklearn.datasets import load_boston
boston = load_boston()
X = pd.DataFrame.from_records(data=boston.data, columns=boston.feature_names)
X.head()
Out[3]:
Let's examine the first few values of our target variable. Notice the value is a real number and not a class, indicating we'll we using regressive models and not classification.
In [4]:
y = boston.target
y[:5]
Out[4]:
By examining the dtypes (data types) attribute of the dataframe, our suspicion is confirmed: all of the features are in fact numeric. Below, we also take a look at whether there are any missing values. Luckily, in this example there are not.
In [5]:
X.dtypes
Out[5]:
In [6]:
X.isnull().sum()
Out[6]:
In [7]:
X.describe()
Out[7]:
Uncommonly—if not never—will your data not need any sort of preprocessing. Be it noisy features, skewed variables, redundant or uninformative features, your data will generally always need some massaging.
One thing to be aware of is whether your input is shuffled or in order... if your data is ordered, know why. For example, let's build a RandomForestRegressor on the first few rows of X, and validate the model on the last rows:
In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
def rmse(act, pred):
return np.sqrt(mean_squared_error(act, pred))
# define the model
model = RandomForestRegressor(random_state=42)
# fit the model
model.fit(X[:350], y[:350])
# assess performance
print('Train R^2: %.5f' % r2_score(y[:350], model.predict(X[:350])))
print('Train RMSE: %.5f\n' % rmse(y[:350], model.predict(X[:350])))
print('Test R^2: %.5f' % r2_score(y[350:], model.predict(X[350:])))
print('Test RMSE: %.5f' % rmse(y[350:], model.predict(X[350:])))
Notice the extreme drop-off in validation performance! It's likely there are phenomena in the test data that were not observed in the training data, and the model was not induced to capture such nuances.
train_test_splitThe sooner you can split your data, the better. sklearn provides a built-in mechanism for just this: sklearn.cross_validation.train_test_split. This will split your data to the specified size, and shuffle the observations at the same time.
Notice we create three splits:
In [9]:
from sklearn.cross_validation import train_test_split
tr_size = int(0.6 * X.shape[0])
va_te_size = int((X.shape[0] - tr_size) / 2)
# split the train/val and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=va_te_size)
# split the train/val apart
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=va_te_size)
print('Train size: %i' % X_train.shape[0])
print('Validation size: %i' % X_val.shape[0])
print('Holdout size: %i' % X_test.shape[0])
In [10]:
# fit the model
model.fit(X_train, y_train)
# assess performance
print('Train R^2: %.5f' % r2_score(y_train, model.predict(X_train)))
print('Train RMSE: %.5f\n' % rmse(y_train, model.predict(X_train)))
print('Val R^2: %.5f' % r2_score(y_val, model.predict(X_val)))
print('Val RMSE: %.5f' % rmse(y_val, model.predict(X_val)))
Notice that our validation performance is now much more similar to our training performance.
Note: It is bad practice to evaluate your model against your test set whilst modeling, so we use our validation set to examine incremental performance
How can we make this model perform better? There may be some strange/skewed distributions within our data that we could coerce into a more normal shape. Let's take a look at just a few (you could do this for all features, but for the sake of example, we'll only look at a handful).
In [11]:
# start by defining a very simple histogram function
def hist(x, scale = 1, style = 'darkgrid', left = None, right = None, xlab='Count', ylab='Y'):
x = x if not isinstance(x, pd.Series) else x.tolist()
figure = plt.figure()
sns.set(style = style)
sns.distplot(x * scale, hist = True, kde = False, norm_hist = True)
ax = figure.get_axes()[0]
ax.set_xlim(left = left or int(np.ceil(np.min(x))),
right = right or int(np.ceil(np.max(x))))
ax.set_xlabel(xlab)
ax.set_ylabel(ylab)
Notice the crime feature is quite skewed. We may be able to make it more normal with a BoxCoxTransformer. We can apply this technique to other features as well, but it is not always guaranteed to work well.
In [12]:
hist(x=X_train.CRIM, ylab='Crime rate')
In [14]:
from skutil.preprocessing import BoxCoxTransformer
hist(BoxCoxTransformer(cols=['CRIM']).fit_transform(X_train).CRIM.tolist(), ylab='Crime transformed')
How do we know which features to retain? In this toy example, we have a manageable amount of features, however in the text analytics or computer vision domains, we often have >100,000 features. Let's explore techniques for reducing this high dimensionality without impacting the predictive power of our model (in no particular order):
1. Eliminate multicollinearity:
In [15]:
from skutil.feature_selection import MulticollinearityFilterer
# let's see if any features are collinear with one another:
fltr = MulticollinearityFilterer(threshold=0.9).fit(X_train)
# examine the drop attribute
fltr.drop
Out[15]:
The MulticollinearityFilterer searches through a correlation matrix for any correlations greater than the provided threshold. When a high correlation is observed between two variables, the function examines the mean absolute correlation of each feature and removes the one that is most highly-correlated with other features as well.
2. Eliminate features with near zero variance:
In [16]:
from skutil.feature_selection import NearZeroVarianceFilterer
# define and fit the filterer
fltr = NearZeroVarianceFilterer(threshold=1e-4).fit(X_train)
# examine the dropped cols
fltr.drop
Notice there are no features with variance less than the threshold, so the result was None. If we wanted, we could adjust that threshold.
3. PCA (Principal Component Analysis)
(Note that this isn't actually a feature selection technique, but a feature reduction technique that results in a set of features which are linear combinations of the original input space)
In [17]:
from skutil.decomposition import SelectivePCA
# define and fit
pca = SelectivePCA(n_components=0.85).fit(X_train)
# examine the head
pca.transform(X_train).head()
Out[17]:
In [20]:
from skutil.preprocessing import SelectiveScaler
# multicollinearity
mcf = MulticollinearityFilterer(threshold=0.9).fit(X_train)
mcf_train = mcf.transform(X_train)
# near zero variance
nzv = NearZeroVarianceFilterer(threshold=1e-4).fit(mcf_train)
nzv_train = nzv.transform(mcf_train)
# add a step: scaling
scl = SelectiveScaler().fit(nzv_train)
scl_train = scl.transform(nzv_train)
# pca
pca = SelectivePCA(n_components=0.85).fit(scl_train)
pca_train = pca.transform(scl_train)
# fit the model
model.fit(pca_train, y_train)
# assess performance on validation set
print('Train R^2: %.5f' % r2_score(y_train, model.predict(pca_train)))
print('Train RMSE: %.5f\n' % rmse(y_train, model.predict(pca_train)))
That's nice, but it's kind of a mess. What if we have a ton of preprocessors to keep track of? Things could get hairy. Furthermore, if we want to assess performance on our validation set, we have to get just as many intermediate predictions. What a pain!
That's what the Pipeline object is for. Pipeline stores a sequence of named transformers with an optional BaseEstimator as the last element. The only arg in the Pipeline constructor is a single list of tuples:
pipe = Pipeline([
('name_of_first_step', FirstTransformer()),
('name_of_second_step', SecondTransformer())
])
In [21]:
from sklearn.pipeline import Pipeline
# define our pipe
pipe = Pipeline([
('mc', MulticollinearityFilterer(threshold=0.9)),
('nzv', NearZeroVarianceFilterer(threshold=1e-4)),
('sc', SelectiveScaler()),
('pca', SelectivePCA(n_components=0.85)),
('rf', RandomForestRegressor(random_state=42))
])
# fit our pipeline
pipe.fit(X_train, y_train)
# assess performance
print('Train R^2: %.5f' % r2_score(y_train, pipe.predict(X_train)))
print('Train RMSE: %.5f\n' % rmse(y_train, pipe.predict(X_train)))
print('Validation R^2: %.5f' % r2_score(y_val, pipe.predict(X_val)))
print('Validation RMSE: %.5f\n' % rmse(y_val, pipe.predict(X_val)))
Notice we get the exact same results, but the code is much more elegant with fewer intermediate variables lying around. We can also assess performance on our validation set.
However, at a closer inspection, we can see that our results are not as good as they were before preprocessing. Presumably, if we could tweak our preprocessing hyperparameters to optimize our algorithm, we could identify a model with better performance. Furthermore, the astute will note that we have not yet introduced any cross validation:
In [22]:
from sklearn.cross_validation import KFold
# the default sklearn cross validation does NOT shuffle, and you know how we feel about that...
custom_cv = KFold(n=y_train.shape[0], n_folds=5, shuffle=True, random_state=42)
Now we introduce the grid search, the mechanism by which we will search over a random space of hyperparameters, building cross-validated models at each iteration and retaining the model that performs best.
In [23]:
# make sure to use the SKUTIL grid search for DF compatability, and not the SKLEARN one.
from skutil.grid_search import RandomizedSearchCV
from scipy.stats import uniform, randint
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
# set our CV
custom_cv = KFold(n=y_train.shape[0], n_folds=5, shuffle=True, random_state=42)
# define our pipe -- let's remove our BC step
pipe = Pipeline([
('mc', MulticollinearityFilterer()),
('nzv', NearZeroVarianceFilterer()),
('sc', SelectiveScaler()),
('pca', SelectivePCA()),
('rf', RandomForestRegressor(random_state=42))
])
# let's define the hyperparameters we'll search over. Notice the form of:
# '<stage_nm>__<arg_nm>'
hyperparams = {
'mc__threshold' : uniform(0.95, 0.05),
'nzv__threshold' : [1e-4, 1e-2],
'sc__scaler' : [StandardScaler(), RobustScaler(), MinMaxScaler()],
'pca__n_components' : randint(4, X.shape[1]),
'pca__whiten' : [True, False],
'rf__n_estimators' : randint(50, 100),
'rf__max_depth' : randint(4, 15),
'rf__min_samples_leaf' : randint(1, 10),
'rf__min_samples_split': randint(2, 5),
'rf__max_features' : uniform(loc=.5, scale=.5),
'rf__max_leaf_nodes' : randint(10,50)
}
# define and fit
search = RandomizedSearchCV(pipe,
hyperparams,
cv=custom_cv,
scoring='r2',
random_state=42,
n_iter=30)
search.fit(X_train, y_train)
# assess performance
print('Validation R^2: %.5f' % r2_score(y_val, search.predict(X_val)))
print('Validation RMSE: %.5f\n' % rmse(y_val, search.predict(X_val)))
We can actually view our grid results like so:
In [24]:
from skutil.utils import report_grid_score_detail
report_grid_score_detail(random_search=search, charts=True)
Out[24]:
In viewing these results, we can make educated decisions on refining our grid such that we don't waste time searching over parameters that detrimentally impact the performance.
In [25]:
# set our CV
custom_cv = KFold(n=y_train.shape[0], n_folds=5, shuffle=True, random_state=42)
# define our pipe
pipe = Pipeline([
('mc', MulticollinearityFilterer()),
('nzv', NearZeroVarianceFilterer()),
('sc', SelectiveScaler(scaler=StandardScaler())), # set to robust
('pca', SelectivePCA()),
('rf', RandomForestRegressor(random_state=42))
])
# we can narrow our search parameters now
hyperparams = {
'mc__threshold' : uniform(0.95, 0.05),
'nzv__threshold' : [1e-4, 1e-2],
'sc__scaler' : [StandardScaler(), RobustScaler(), MinMaxScaler()],
'pca__n_components' : randint(8, X.shape[1]),
'pca__whiten' : [True, False],
'rf__n_estimators' : randint(75, 100),
'rf__max_depth' : randint(4, 15),
'rf__min_samples_leaf' : randint(1, 8),
'rf__min_samples_split': randint(2, 5),
'rf__max_features' : uniform(loc=.5, scale=.5),
'rf__max_leaf_nodes' : randint(25,50)
}
# define and fit
search = RandomizedSearchCV(pipe,
hyperparams,
cv=custom_cv,
scoring='r2',
random_state=42,
n_iter=30) # incremented to 20 X
search.fit(X_train, y_train)
# assess performance
print('Validation R^2: %.5f' % r2_score(y_val, search.predict(X_val)))
print('Validation RMSE: %.5f\n' % rmse(y_val, search.predict(X_val)))
In [26]:
from sklearn.ensemble import GradientBoostingRegressor
# define our pipe
gbm_pipe = Pipeline([
('mc', MulticollinearityFilterer(threshold=0.9)),
('nzv', NearZeroVarianceFilterer(threshold=1e-4)),
('sc', SelectiveScaler()),
('pca', SelectivePCA(n_components=0.85)),
('gbm', GradientBoostingRegressor(random_state=42))
])
# let's define the hyperparameters we'll search over.
gbm_hyperparams = {
'mc__threshold' : uniform(0.80, 0.15),
'sc__scaler' : [StandardScaler(), RobustScaler(), MinMaxScaler()],
'pca__n_components' : uniform(0.95, 0.05),
'gbm__n_estimators' : randint(90, 200),
'gbm__learning_rate' : uniform(0.075, 0.05),
'gbm__max_depth' : randint(2, 7), # we grow these more shallow
}
# define and fit
gbm_search = RandomizedSearchCV(gbm_pipe,
gbm_hyperparams,
cv=custom_cv,
scoring='r2',
random_state=42,
n_iter=30)
gbm_search.fit(X_train, y_train)
# assess performance
print('Validation R^2: %.5f' % r2_score(y_val, gbm_search.predict(X_val)))
print('Validation RMSE: %.5f\n' % rmse(y_val, gbm_search.predict(X_val)))
Wow, that looks fantastic! However, GBMs are much more likely to overfit the data. We'll need to see how it performs on the holdout set to determine whether it's actually a good model.
This happens once. When you've built a selection of models, we evaluate each one time against the holdout set for our final model selection.
In [27]:
# examine RF performance
print('RF test R^2: %.5f' % r2_score(y_test, search.predict(X_test)))
print('RF test RMSE: %.5f\n' % rmse(y_test, search.predict(X_test)))
# examine GBM performance
print('GBM test R^2: %.5f' % r2_score(y_test, gbm_search.predict(X_test)))
print('GBM test RMSE: %.5f' % rmse(y_test, gbm_search.predict(X_test)))
In our RF, our validation error closely resembles that of our holdout error, which is indicative that we are not overfitting. However, our GBM indicates otherwise. We can tune our hyperparameters to fix this, but for the sake of example, we will not in this demo.
See the next demo for information on how to persist models to disk.
In [ ]: